Red Wine by Nadim Kawwa

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Take a quick look at the data. At a glance we can tell that most wines are acidic, and the alcohol mean is more or less equal to the median. The section below will plot these variables individually and interpret the output.

Univariate Plots Section

Start by plotting parameters in order to see their distribution.

The quality is between 3 and 8, with most of the winesrated a 5 or a 6. It hence makes sense that the average quality is somewhere in that range as seen with the red line.

The pH looks normally distributed. For this first part we’re not elemenating any outliers. Note that there’s an outlier with a pH close to 1 (is that wine or water?)

## Warning: Removed 8 rows containing non-finite values (stat_bin).

Alcohol content is clearly skewed to the right, with higher alcohol contents the rarer of the bunch. Note that for alcohol content we don’t know how large the wine bottle (or keg?) is.

The fixed acitidity looks like it could be skewed to the left. We also notice how it has a single peak. For a better visualisation let’s go to the next plot.

To take another look at fixed acidity draw a box plot. On top of that overlay the scattered points of tbe acidity values. Most acidity values are between 7 and 9.

Density is another metric to measure. What the plot tells is that the density is normally distributed, and that most wines have a density more or less equal to water.

Citric acid is what gives wine its sour taste. We don’t see much of a trend in the distribution.

Residual sugar content is clearly skewed to the right with a peak at 2.2. Are most wines on the lower sugar content side?

Clorides are what give the wine it’s salty taste. The distribution is skewed to the right with most wines having very little salt content. The long tail is between 0.2-0.6 and was omitted for this plot.

Volatile acidity is what gives the wine its vinegar taste. We can tell from the plot that the distribution is bimodal, with one peak at 0.4 and another at 0.6.

Above 50ppm SO2 becomes detectable by smell. for water 1ppm = 1mg/L with 1 liter = 1 dm^3. Let’s assume wine is like water in this respect since their densities are almost equal. The distribution is skewed to the right, less than half of the wines can be tested for SO2 by smell.

Univariate Analysis

The dataset contains a collection of parameters measured off red wine samples. The advantage of these observations is that they are measured in discrete values that can be measured. For example saltiness is measured in SO2 concentration rather than subjective terms (salty, not very salty, etc…) We will try to see if there is a correlation between these features and the quality of the wine. It will be of interest to see if there is a correlation among these paramters themselves. Among the paramters only citric acid content did not show a distinct pattern. Some of the plots with longer tails were trimmed in order to better show the distribution near the peak. At this point no new variables were created.

Bivariate Plots Section

We took four arbitrary parameters that should influence quality. The strongest correlation for quality exists for alcohol content. pH seems to have almost no effect on quality which sounds counterintuitive. For all wines the quality of the soil definetly influences the quality of the grapes. pH and citric acid have a negative correlation which makes sense. The more acidic the solution the lower the pH.

Try plotting alcohol vs quality and fit in a linear model regression. As expected the slope is positive. This does excludes the top 5% and bottom 5% of alcohol content.

This box plot representaion of the same data offers a better insight. We can guess that most wines have a quality between 5 and 6 as indicated by the contration of data points. More noticeable is that the higher quality wines (score 7-8) have a higher alcohol content. However we can’t ignore one pertinent outlier: Wines rated as 5 have a lower mean alcohol contnet than those rated 3-4. It would have been convenient to draw a neat line connecting the means however that doesn’t seem to be plausible.

Contrary to the names, these two are not inversly related. Take out the top and bottom 5% and what you get is the plot above.

Clearly there is no correlation here, the nodes are arranged in a homogenous fashion. Adding a linear model regression shows a horizontal line which again confirms our observation. One shortcoming of this is the omission of dessert wine by omitting the top 10% sweetest wines. We will plot again for the sweetest wines.

Not much of a difference here. Turns out the sweetness of the wine is irrelevant for us.

By omotting the more pungent wines (SO2 > 150ppm), we can see that quality and total sulfure dioxide are negatively correlated. This is using a second order polynomial fit although the relationship seems linear. SO2 has a suffocating smell and higher concentrations negatively impact the quality of the wine.

Not much of a difference here. Turns out the sweetness of the wine is irrelevant for us.

There seems to be a negative correlation between volatile acidity and wine quality. Since volatile acidity is mostly smaller than 1, apply a square root transform to obtain larger values and maybe spread out the scatter.

Bivariate Analysis

At this point it is worth moving into multivariate analysis. We will explore alcohol content, volatile acidity, and total sulfure dioxide. These four appear to be the strongest in terms of correlation.The strongest relationship for determining the quality of a wine appears to be alcohol content. What is missing from the above plots is one or more variables with a correlation coefficient larger than 0.7 to be considered significant.

Multivariate Plots Section

Not much of a discernable color pattern here. This suggests that SO2 concentration isn’t signficant enough if coupled with alcohol content.

The plot above colored by citric acid content shows more. The higher quality wines are more brightly colored which suggests that citric acid couipled with alcohol content makes for a higher quality wine.

There seems to be a positive correlation between alcohol and pH value. That being said most wines have a pH between 3 and 3.5. If we look at the linear fit for wines between 7 to 8 quality rating they are on the lower side of pH value. Conversly wines with a poor 3-4 quality are in the upper part of the plot signaling a higher pH value. The oddities are at the extremeties of the fitten models. For qualitites 3 to 6 the linear models are overlapping at low alcohol content (less than 10%). For heavier contents the qualities 6-8 seem to converge.

Multivariate Analysis

There seem to be two major factors that influence quality based on the above: alcohol content and citric acid concentration. The parameters are not sufficiently impactful to influence the quality of the red wine. There might be other factors at play such as the taste of the people giving the rating.


Final Plots and Summary

Plot One

Description One

The table above shows a correlation between arbitrary pairs. We can tell at a glance that alcohol and citric acid will play a more significant role in predicting the quality of the wine.

Plot Two

Description Two

The alcohol content is plotted against quality in the form of a box plot. This takes into a clearer perspective the 0.476 correlation coefficient sceen in Plot_one. We can infer that a higher alcohol content means a better quality. However we can’t ingnore the dip in alcohol content seen at wines with a score of 5. Does alcohol within a certain range interect badly with other elements that give wine its taste?

Plot Three

Description Three

The third plot which I thought would be interesting is pH vs Alcohol colored by quality. First to notice are oddities. Wines ranging from 3 to 6 quality are given the same linear fit; those with qualities 6 to 8 appear to converge towards the end. Looking away from the edges the higher quality wines have a low pH and the poorer quality wiens have a high pH. This observation is based on the linear models fitted to the plot and may not be entirely accurate. We don’t believe that a large enough number of outliers to skew the fit.


Reflection

The dataset contains a large sample of wines with parameters that can be objectively measured. Throughout the study an attempt was made to find if there is a strong correlation between quality and one of the aspects Surprisingly none of the parameters had a correlation factor bigger than 0.7, the arbitrary value beyond which a strong correlation is evident. There are some limitations to the dataset, beginning by the identity of the graders. A wine expert is supposedly a sommelier and there are levels to that qualifications, does one understand more about quality than the others? There is also the limitation about geographic origin. However what would be more insightful is to take a single wine producer and track the quality of their vintages over time.